12/11/2018
## 'data.frame': 891 obs. of 12 variables: ## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ... ## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... ## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... ## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... ## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... ## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ... ## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Ticket class 1st=upper, 2nd=Middle, 3rd=lower
Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
We need to check the NA value in variables.
We use the most frequent values "S" to replace the NA in \(Embarked\).
We use mean of \(Age\) to replace the NA in \(Age\).
We drop \(Cabin\) becuase it has large percentage of NA and least importance.
Classification criterion
Single: family member= 0 Small: family member= 1 || family member= 2 Big: family member > 2
## Big Single Small ## 91 537 263
Classification criterion
Child: age<= 6 Juvenile: 6< age<= 17 Youth: 17< age<= 40 MiddleAge: 40.5<= age <=65 Senium: age> 65
We select \(Survived\) as dependent variable.
Becuase \(Survived\) is binary so we decide to use logistic regression.
## ## Call: ## glm(formula = Survived ~ Sex + AgeGroup + FamilySize + Pclass + ## Embarked, family = binomial(link = "logit"), data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6809 -0.6706 -0.4057 0.6120 2.4845 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 4.00043 0.54265 7.372 1.68e-13 *** ## Sexmale -2.78512 0.20785 -13.400 < 2e-16 *** ## AgeGroupJuvenile -2.07572 0.55559 -3.736 0.000187 *** ## AgeGroupMiddleAge -3.10495 0.51992 -5.972 2.34e-09 *** ## AgeGroupSenium -3.84666 1.20691 -3.187 0.001437 ** ## AgeGroupYouth -2.52134 0.46818 -5.385 7.23e-08 *** ## FamilySizeSingle 1.44313 0.35590 4.055 5.02e-05 *** ## FamilySizeSmall 1.52340 0.35129 4.337 1.45e-05 *** ## Pclass2 -0.98748 0.26969 -3.662 0.000251 *** ## Pclass3 -2.15825 0.25233 -8.553 < 2e-16 *** ## EmbarkedQ -0.04071 0.38457 -0.106 0.915685 ## EmbarkedS -0.43479 0.24216 -1.796 0.072571 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1186.66 on 890 degrees of freedom ## Residual deviance: 764.59 on 879 degrees of freedom ## AIC: 788.59 ## ## Number of Fisher Scoring iterations: 5
## (Intercept) Sexmale AgeGroupJuvenile AgeGroupMiddleAge ## 54.62162288 0.06172143 0.12546658 0.04482697 ## AgeGroupSenium AgeGroupYouth FamilySizeSingle FamilySizeSmall ## 0.02135101 0.08035183 4.23394352 4.58780627 ## Pclass2 Pclass3 EmbarkedQ EmbarkedS ## 0.37251559 0.11552692 0.96010289 0.64739747
subtrain dataset (80%)
subvalidation dataset (20%)
## ## Call: ## glm(formula = Survived ~ Sex + AgeGroup + FamilySize + Pclass, ## family = binomial(link = "logit"), data = subtrain) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.5330 -0.6091 -0.4544 0.5860 2.4386 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.1667 0.5613 5.642 1.68e-08 *** ## Sexmale -2.7487 0.2244 -12.250 < 2e-16 *** ## AgeGroupJuvenile -1.6740 0.6380 -2.624 0.008694 ** ## AgeGroupMiddleAge -2.9221 0.6084 -4.803 1.56e-06 *** ## AgeGroupSenium -3.2328 1.2669 -2.552 0.010721 * ## AgeGroupYouth -2.2197 0.5537 -4.009 6.09e-05 *** ## FamilySizeSingle 1.7006 0.4160 4.088 4.35e-05 *** ## FamilySizeSmall 1.7828 0.4101 4.347 1.38e-05 *** ## Pclass2 -0.9727 0.2893 -3.362 0.000773 *** ## Pclass3 -2.1174 0.2670 -7.931 2.17e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 954.46 on 712 degrees of freedom ## Residual deviance: 632.72 on 703 degrees of freedom ## AIC: 652.72 ## ## Number of Fisher Scoring iterations: 5
## (Intercept) Sexmale AgeGroupJuvenile AgeGroupMiddleAge ## 23.72803425 0.06401285 0.18748912 0.05382059 ## AgeGroupSenium AgeGroupYouth FamilySizeSingle FamilySizeSmall ## 0.03944646 0.10864363 5.47733010 5.94652228 ## Pclass2 Pclass3 ## 0.37804731 0.12034388
## [1] "Accuracy 0.837988826815642"
## [1] 0.8818614
Thank you!